library(networkD3)
library(dplyr)
library(tidyr)
library(stringr)
library(RColorBrewer)
library(pheatmap)
library(arules)
library(arulesViz)
Cargo theft presents a sinificant threat to supply chain security which affects economic stability and the public safety. According to the definition by the Federal Bureau of Investigation (FBI)1, cargo theft includes, but is not restricted to the theft of items, personal property, cash, or luggage that makes up, entirely or partially, a commercial freight shipment in transit. The FBI carefully monitors cargo theft cases throughout the United States and gathers comprehensive yearly data to recognize patterns and trends. This paper will utilize two techniques for association rule mining, specifically the Apriori and ECLAT algorithms, to identify frequent patterns and relationships in the cargo theft incidents data in 2023. The findings can help identify and understand the key factors that influence the value of stolen cargo in theft incidents. It is hoped to identify patterns or scenarios associated with low, moderate, high, and very high stolen value incidents. The comprehensive pattern discovery will provide insights into the underlying factors contributing to different levels of cargo theft values and understand its various dimensions, including the locations prone to cargo theft, financial losses, types of cargo most frequently stolen, characteristics and demographic patterns of offenders. It is also possible to find out some rare patterns showing the level of stolen values of cargo items. After investigating the financial impact of cargo theft by correlating theft incidents with the value of stolen goods, with the aids of visualizations to present the findings in a clear and concise manner, it is believed that there can be some actionable strategies to minimize theft based on the stolen-value-driven analysis.
Cargo in USA (Image source retrieved from FBI’s official website)
In order to make the data suitable for association rules mining, the data is loaded in basket format and then creates a sparse matrix of transactions.
transactions <- read.transactions("CT2023.csv", format = "basket", sep = ",", skip = 1)
selections <- transactions[, itemFrequency(transactions) > 0.2]
itemFrequencyPlot(transactions, topN = 10, type = "absolute", main = "Item Frequency (Absolute)", ylab = "Item Frequency", xlab = "Item Name", col = "darkred")
round(dissimilarity(selections, which = "items"), 2)
## All Other Larceny Business Individual Low M
## Business 0.81
## Individual 0.83 1.00
## Low 0.81 0.82 0.42
## M 0.85 0.80 0.76 0.72
## Not Hispanic or Latino 0.83 0.85 0.81 0.79 0.57
## Not recovered 0.77 0.75 0.38 0.34 0.73
## Residence/Home 0.88 0.98 0.69 0.77 0.85
## South Atlantic 0.86 0.86 0.67 0.70 0.82
## Theft From Motor Vehicle 1.00 0.89 0.76 0.75 0.92
## Young Adult 0.86 0.84 0.81 0.78 0.57
## Not Hispanic or Latino Not recovered Residence/Home
## Business
## Individual
## Low
## M
## Not Hispanic or Latino
## Not recovered 0.81
## Residence/Home 0.88 0.77
## South Atlantic 0.83 0.67 0.75
## Theft From Motor Vehicle 0.94 0.75 0.89
## Young Adult 0.61 0.79 0.88
## South Atlantic Theft From Motor Vehicle
## Business
## Individual
## Low
## M
## Not Hispanic or Latino
## Not recovered
## Residence/Home
## South Atlantic
## Theft From Motor Vehicle 0.83
## Young Adult 0.86 0.94
pheatmap(as.matrix(dissimilarity(selections, which = "items")), symm = TRUE, main = "Item Dissimilarity Heatmap", color = colorRampPalette(brewer.pal(9, "Paired"))(50), border_color = NA)
Before applying association rules, there might be a need to understand the dataset’s structure better and improve the quality of patterns. It’s a worthwhile addition to the association rules pipeline. There are a lot of items and transactions in the data. By grouping items or transactions based on their similarities and dissimilarities using Jaccard Dissimilarity, which is an useful metric when the data is sparse, we can identify clusters of items that behave similarly and understand item relationships. After filtering the items that appear in more than 20% of transactions, there are frequently occurring and significant items and it is able to measures how many items two transactions have in common relative to their union. The lower values in dissimilarity matrix indicate items are often bought together while higher values suggest less association. From the results, it tells that there are several items with high dissimilarity. It may be rare to see them appearing together as antecedents in the association rules. “Individual”, “Low” and “Not Recovered” (0.34 - 0.38) co-occur frequently. It could mean that the cargo stolen values classified as “Low” often cannot be recovered after investigation and the victims are usually independent. “Business” and “Theft From Motor Vehicle” (0.89) indicates these two rarely occur together. Perhaps the victims in cargo theft incidents categorized as business entities rarely involve in the theft from motor vehicles. It may imply that the business cargo are large which is difficult to be robbed by motor vehicles. Meanwhile, the heat map helps visualize relationships among items or transactions, and dissimilarity matrix. Clusters of similar items appear as blocks of similar colors which helps identify related items visually.
itemsets <- eclat(transactions, parameter = list(supp = 0.25, maxlen = 10))
inspect(sort(itemsets, by = "support", decreasing = TRUE))
## items support count
## [1] {Not recovered} 0.8459867 29585
## [2] {Low} 0.6782477 23719
## [3] {Individual} 0.6750450 23607
## [4] {Low, Not recovered} 0.6040719 21125
## [5] {Individual, Not recovered} 0.5848560 20453
## [6] {Individual, Low} 0.4982986 17426
## [7] {Individual, Low, Not recovered} 0.4550056 15912
## [8] {South Atlantic} 0.3397958 11883
## [9] {M} 0.3289583 11504
## [10] {Not recovered, South Atlantic} 0.2925567 10231
## [11] {Business} 0.2764004 9666
## [12] {Individual, South Atlantic} 0.2518086 8806
ECLAT algorithm here is used for preliminary rules mining. Frequent itemsets are firstly being experimented with adjusting and controling the support and maximum sizes of output. High support itemsets indicate strong common patterns for the occurrence of cargo thefts. Referring to the previous dissimilarity matrix, it is already known that “Low”, “Not Recovered” and “Individual” have high similarity. From the general observations of the results, there are some frequent combinations like {Low, Not recovered} (0.604) and {Individual, Low, Not recovered} (0.455). It implies that low-value cases involved individuals often result in items not being recovered. And geographically, around 34% of incidents occurred in the South Atlantic region.
closed_set <- eclat(transactions, parameter = list(supp = 0.001, maxlen = 5, target = "closed frequent itemsets"))
inspect(sort(closed_set, by = "support", decreasing = FALSE)[1:5])
## items support count
## [1] {East North Central,
## FAIRFIELD,
## MSA counties from 25,000 thru 99,999,
## Ohio} 0.001000829 35
## [2] {Drug Equipment,
## Not recovered} 0.001000829 35
## [3] {Individual,
## Pets} 0.001000829 35
## [4] {California,
## Cities from 250,000 thru 499,999,
## Not recovered,
## Pacific,
## SAN JOAQUIN} 0.001000829 35
## [5] {Arizona,
## Low,
## Mountain,
## MSA counties from 25,000 thru 99,999,
## YUMA} 0.001000829 35
There are itemsets that appear frequently but adding any extra item would decrease their frequency. They are specific patterns. In the results, all itemsets have the same support of 0.001000829 which means each of these itemsets appears in approximately 0.1% of the transactions. All these itemsets have a very low support of 0.001% which means that they appear only once in about 1000 transactions. However, at the same time, the redundancy are reduced and it focuses on unique and informative itemsets, and the co-occurrence of some items makes them interesting as association rules. As a case in point, there were pets as cargo items being the thefts. These itemsets seem to involve specific grographical locations, demographic groups, and cargo item categories.
Using “ruleInduction” can generate rules from the frequent itemsets identified by ECLAT. It allows to have a brief review on the strongest patterns.
rules <- ruleInduction(itemsets, transactions, confidence = 0.5)
inspect(sort(rules, by = "confidence", decreasing = TRUE)[1:10])
## lhs rhs support confidence
## [1] {Individual, Low} => {Not recovered} 0.4550056 0.9131183
## [2] {Low} => {Not recovered} 0.6040719 0.8906362
## [3] {Individual} => {Not recovered} 0.5848560 0.8663956
## [4] {South Atlantic} => {Not recovered} 0.2925567 0.8609779
## [5] {Individual, Not recovered} => {Low} 0.4550056 0.7779788
## [6] {Low, Not recovered} => {Individual} 0.4550056 0.7532308
## [7] {South Atlantic} => {Individual} 0.2518086 0.7410587
## [8] {Individual} => {Low} 0.4982986 0.7381709
## [9] {Low} => {Individual} 0.4982986 0.7346853
## [10] {Not recovered} => {Low} 0.6040719 0.7140443
## lift itemset
## [1] 1.079353 3
## [2] 1.052778 6
## [3] 1.024124 4
## [4] 1.017720 1
## [5] 1.147042 3
## [6] 1.115823 3
## [7] 1.097791 2
## [8] 1.088350 5
## [9] 1.088350 5
## [10] 1.052778 6
The rules with high confidence, for example the first few rules, indicate strong relationships between items. For instance, when “Low” is present, “Not recovered” is very likely to appear, and vice versa. The rules from LHS to RHS provide some insights into how the presence of certain items or conditions influences the occurrence of others in the transaction dataset. Yet, the rules are general and not diverse. It is hard to interpret and evaluate and factors causing the stealing of different values of cargo items. Therefore, Apriori algorithm is used for more specific rules mining by adjusting the support and confidence level.
After obtaining insights (frequent itemsets) from ECLAT, we can use Apriori to create targeted association rules for different categories of stolen values. The consequent (RHS) is the specific categories of stolen values, which are “Low”, “Moderate”, “High”, and “Very High”.
ap_Low <- apriori(transactions, parameter = list(supp = 0.1, conf = 0.5), appearance = list(rhs = c("Low"), default = "lhs"), control = list(verbose = F))
ap_Low <- sort(ap_Low, by = "confidence", decreasing = TRUE)
inspect(head(ap_Low))
## lhs rhs support confidence coverage lift count
## [1] {Individual,
## Not recovered,
## Theft From Motor Vehicle} => {Low} 0.1422607 0.8478187 0.1677962 1.250013 4975
## [2] {Individual,
## Theft From Motor Vehicle} => {Low} 0.1485517 0.8397995 0.1768894 1.238190 5195
## [3] {Not recovered,
## Theft From Motor Vehicle} => {Low} 0.1740871 0.8003155 0.2175231 1.179975 6088
## [4] {Individual,
## M,
## Not recovered} => {Low} 0.1229018 0.7948955 0.1546138 1.171984 4298
## [5] {Black or African American,
## Not recovered} => {Low} 0.1145807 0.7934653 0.1444054 1.169875 4007
## [6] {Theft From Motor Vehicle} => {Low} 0.1822653 0.7921949 0.2300763 1.168002 6374
plot(ap_Low, measure = c("support", "lift"), shading = "confidence", main = "Scatter Plot for 48 Rules on Contributing Low Stolen Values")
From the results, high confidence values (above 80%) suggest strong relationships between the left-hand side (lhs) conditions and the right-hand side (“Low”). The most common and strong association is between “Individual”, “Not recovered”, and “Theft From Motor Vehicle”, and the likelihood of having a low stolen cargo value. These combinations suggest that if these conditions hold, it’s quite likely that the stolen value is low. For example, from the first rule, when the cargo items owned by individuals are robbed from motor vehicles and ultimately cannot be recovered, there is an 84.78% chance that the stolen value of cargo is low. The lifts of the rules are all higher than 1 and there are a huge amount of count. It means that the rules are reliable and common. These three conditions are associated with low cargo theft values.
Besides, from the scatter plot, the shading represents the confidence of the rules. Confidence is the likelihood that the rhs will occur given the lhs. The color shading shows how strong or weak the confidence is for each rule. Darker shades indicates higher confidence. Lighter shades indicates lower confidence. The points in the scatter plot that are high on the y-axis (high lift) with dark shading (high confidence), but a bit spread out. It still indicates the rule are frequent, strongly associated with low stolen cargo values, and reliable.
ap_Moderate_1 <- apriori(transactions, parameter = list(supp = 0.001, conf = 0.08), appearance = list(rhs = c("Moderate"), default = "lhs"), control = list(verbose = F))
ap_Moderate_1 <- sort(ap_Moderate_1, by = "confidence", decreasing = TRUE)
inspect(head(ap_Moderate_1))
## lhs rhs support confidence coverage lift count
## [1] {COCONINO,
## Department/Discount Store} => {Moderate} 0.001029424 1 0.001029424 7.363866 36
## [2] {Clothes/ Furs,
## COCONINO} => {Moderate} 0.001029424 1 0.001029424 7.363866 36
## [3] {COCONINO,
## Not Hispanic or Latino} => {Moderate} 0.001058020 1 0.001058020 7.363866 37
## [4] {Arizona,
## COCONINO,
## Department/Discount Store} => {Moderate} 0.001029424 1 0.001029424 7.363866 36
## [5] {Clothes/ Furs,
## COCONINO,
## Department/Discount Store} => {Moderate} 0.001029424 1 0.001029424 7.363866 36
## [6] {Cities from 50,000 thru 99,999,
## COCONINO,
## Department/Discount Store} => {Moderate} 0.001029424 1 0.001029424 7.363866 36
ap_Moderate <- apriori(transactions, parameter = list(supp = 0.01, conf = 0.2), appearance = list(rhs = c("Moderate"), default = "lhs"), control = list(verbose = F))
ap_Moderate <- sort(ap_Moderate, by = "confidence", decreasing = TRUE)
inspect(head(ap_Moderate))
## lhs rhs support confidence coverage lift count
## [1] {Not recovered,
## Tools} => {Moderate} 0.01295359 0.2843691 0.04555203 2.094056 453
## [2] {Tools} => {Moderate} 0.01375425 0.2722128 0.05052758 2.004539 481
## [3] {Mountain,
## Young Adult} => {Moderate} 0.01055160 0.2314931 0.04558062 1.704684 369
## [4] {Cities 1000000 or over,
## Nevada,
## Not recovered} => {Moderate} 0.01481227 0.2259049 0.06556861 1.663534 518
## [5] {Cities 1000000 or over,
## CLARK,
## Not recovered} => {Moderate} 0.01481227 0.2259049 0.06556861 1.663534 518
## [6] {Cities 1000000 or over,
## Mountain,
## Not recovered} => {Moderate} 0.01481227 0.2259049 0.06556861 1.663534 518
plot(ap_Moderate_1, measure = c("support", "lift"), shading = "confidence", main = "Scatter Plot for 18536 Rules on Contributing Moderate Stolen Values")
plot(ap_Moderate, measure = c("support", "lift"), shading = "confidence", main = "Scatter Plot for 31 Rules on Contributing Moderate Stolen Values")
When the parameters are set as 0.01 and 0.8 for support and confidence respectively, none of the rules is found, which implies that the rules are too restrictive. The strict threshold filters out most associations because it is hard to achieve such high confidence consistently. When the thresholds are set as 0.001 and 0.08 for support and confidence respectively, rare event are detected. Many rules are with perfect confidence (1) and the lifts are extremely high, which are more than 7. Higher lift means higher correlation. While confidence is perfect, the support is extremely low (only 36 transactions). It means the rule applies to very few cases and may lack generalizability. These could be anomalies or niche patterns. From the inspection, it is obvious that all high-confidence rules involve Coconino County. On the other hand, “Department/Discount Stores” and “Clothes/Furs” are frequent. Those stores may have those products commonly targeted for moderate-value thefts.
However, since these rules apply to small subsets (extremely low support), there can be a data bias toward incidents and patterns. From the scatter plot, it gives the high number of rules generated which means the rules are not common and random in many cases. There should be an amendment on the parameter. When the thresholds are set as 0.01 and 0.2 for support and confidence respectively, there is a good balance generating decent number of rules with moderate reliability. Lift values more than 1 already indicate statistically significant and meaningful patterns. Confidence now is 20 – 28%, which is more realistic for predictive models. High-confidence rules (100%) from the earlier set were rare and too specific. More counts, higher support and lower lift indicate the rules are more generalizable and still have strong associations to the moderate stolen values. It is apparent that “Tools” have high support and confidence with lift > 2. It means that tools are usually the products and cargo items being stolen and they are grouped as moderate values. Rules involving “Not recovered” status consistently show a strong association with Moderate theft values. Unrecovered cargo tends to fall into the Moderate value range, possibly due to the nature of goods that are difficult to track, like tools. Additionally, “Cities 1,000,000 or over” also indicate the large cities with population over 1 million have a logistic security problem especially for the cargo theft valued at moderate level.
ap_High <- apriori(transactions, parameter = list(supp = 0.01, conf = 0.4), appearance = list(rhs = c("High"), default = "lhs"), control = list(verbose = F))
ap_High <- sort(ap_High, by = "confidence", decreasing = TRUE)
inspect(head(ap_High))
## lhs rhs support confidence coverage lift count
## [1] {Automobile,
## Individual,
## Recovered,
## South Atlantic} => {High} 0.01040862 0.7338710 0.01418318 5.091093 364
## [2] {Automobile,
## Individual,
## Maryland} => {High} 0.01038003 0.7303823 0.01421178 5.066891 363
## [3] {Automobile,
## Individual,
## Maryland,
## South Atlantic} => {High} 0.01038003 0.7303823 0.01421178 5.066891 363
## [4] {Automobile,
## Individual,
## Motor Vehicle Theft,
## South Atlantic} => {High} 0.01229590 0.6969206 0.01764319 4.834757 430
## [5] {Automobile,
## Recovered,
## South Atlantic} => {High} 0.01218152 0.6949429 0.01752881 4.821037 426
## [6] {Automobile,
## Maryland} => {High} 0.01215293 0.6944444 0.01750021 4.817579 425
plot(ap_High, measure = c("support", "lift"), shading = "confidence", main = "Scatter Plot for 50 Rules on Contributing High Stolen Values")
When the thresholds are set higher, none of the rules is found. Therefore, the support and confidence are set as 0.01 and 0.4 respectively. Based on the results, it shows that individual automobile thefts in the South Atlantic, even when recovered are highly likely to involve High-value cargo. There are geographical hotspots for South Atlantic and Maryland, which implies that those areas are high-risk zones for High-value cargo theft. Surprisingly, automobile is the common cargo item for high-value cargo thefts but they can possibly being recovered. Those victims are usually individual owners. From the scatter plot, the confidence and lift are maximum 0.75 and between 2.5 to 5.5, which is very strong association between antecedent and consequent.
ap_Very_High <- apriori(transactions, parameter = list(supp = 0.005, conf = 0.08), appearance = list(rhs = c("Very High"), default = "lhs"), control = list(verbose = F))
ap_Very_High <- sort(ap_Very_High, by = "confidence", decreasing = TRUE)
inspect(head(ap_Very_High))
## lhs rhs support
## [1] {Business, Texas} => {Very High} 0.005347288
## [2] {Business, Texas, West South Central} => {Very High} 0.005347288
## [3] {Business, West South Central} => {Very High} 0.005375883
## [4] {Texas} => {Very High} 0.007978039
## [5] {Texas, West South Central} => {Very High} 0.007978039
## [6] {Business, Commercial/Office Building} => {Very High} 0.006462497
## confidence coverage lift count
## [1] 0.2586445 0.02067427 6.186770 187
## [2] 0.2586445 0.02067427 6.186770 187
## [3] 0.2520107 0.02133196 6.028090 188
## [4] 0.1841584 0.04332161 4.405064 279
## [5] 0.1841584 0.04332161 4.405064 279
## [6] 0.1816720 0.03557233 4.345590 226
plot(ap_Very_High, measure = c("support", "lift"), shading = "confidence", main = "Scatter Plot for 28 Rules on Contributing Very High Stolen Values")
From the rules generated, it shows that 25.86% (confidence) chance the victim type of the cargo theft is business and usually happens in Texas when the stolen value of the cargo item is over 50000 USD (Very High). Theft involving businesses in Texas is 6.19 (lift) times more likely to involve very high-value cargo compared to random theft events. Also, similar to Rule 1, but more geographically specific, adding the US broader region (West South Central), with the same support and confidence, it indicates strong correlation even when zooming out geographically. Not just Texas but thefts in the West South Central region who the victims are businesses they usually suffer from the loss of very high-valued cargo items. The entire region has similar risk of facing very-high-valued cargo thefts. The extremely high-valued cargo products owned by business face the security risk in Texas and even in West South Central region in US. The lift values of Rule 1 to Rule 3 are higher than the rest which implies the strong associations that are not random. When increasing thresholds, for example trying to capture rules that support apply to 1%, 2%, or more of the data, the algorithm cannot find any frequent itemsets that meet the thresholds. The theft of very high-valued cargo (more than 50000 USD) is rare, but when it happens, it has strong associations with specific factors like the businesses in Texas. Low-support rules are still valuable because they highlight specific high-risk scenarios even if they are not widespread.
The above Sankey diagram provides the flow of relationships between items, especially when dealing with rule-based outputs from Apriori association rules. They are clustered into 4 levels of the stolen value of the cargo item. The flow generally goes from the cause or condition (LHS) to the outcome (RHS), which are the factors resulting the different stolen values. In the diagram, the thicker the link, the stronger the relationship determined by the confidence value investigated previously. A thicker line means a higher confidence to the specific stolen value while a thinner line shows a weaker association. Some nodes connect to the same RHS such as “Not Recovered” and “Individual”. It indicates that different conditions can lead to the same outcome. We can see that the links for “Very High” value and “Moderate” value are thin but for the rest are thick. It is common to find out the general rules for “High” and “Low” valued cargo theft but difficult to predict and discover the hidden patterns for the thefts “Luxury” cargo items and “Moderate” cargo items.
To conclude, ECLAT firstly is used to identify frequent itemsets in cargo theft incidents. The initial findings show significant items distributed to the association rules such as “Individual” victims associated with “Low” and “High” stolen values, and, “Not Recovered” items associated with “Low” and “Moderate” stolen values. These patterns indicate that low-values cargo thefts are less likely to result in recovery and typically involve individual victims, and high-valued cargo thefts sometimes happens in South Atlantic area and individual victims suffer it. Closed itemsets are explored which represent unique and highly specific patterns in the data but these itemsets occur rarely with extremely low support.
Apriori algorithm is used with varying support and confidence thresholds to generate association rules. Cargo thefts involving individuals and theft from motor vehicles tend to result in low stolen values, while large cities with more than 1 million population are associated with moderate-value thefts and the property are often tools. Some specific items like automobiles and specific ways to conduct thefts like using motor vehicles seem common in cargo thefts. Scatter plots are generated to assist the visualization of the relationships between support, lift, and confidence which provides insights into the strength of the association rules between the antecedents and the outcomes. The patterns for high and very high stolen values are discovered that specific locations like Texas and West South Central, and victim types like businesses are highly correlated with these extreme theft incidents. It is important to be noted that business entities are more likely to be involved in very high-valued cargo thefts, especially in Texas or even in West South Central US region. Despite the fact that the support is low, but it does not mean low importance.
In fact, for cargo theft, the rarest patterns might be the most dangerous because they indicate targeted, organized crime rather than random thefts. It suggests organized, targeted cargo theft activities in these areas, highlighting a systemic issue in general cargo thefts. It is possibly due to valuable cargo routes or weak security measures. For amelioration of the cargo thefts in specific regions, there should be a deliberation of stricter security protocols and coordinated efforts with law enforcement in these areas. The association rules results can help prioritize areas for intervention such as focusing on certain cargo item categories or regions with a high likelihood of thefts.
Cargo theft does not follow a uniform pattern across the entire dataset. Actually, the diversity of theft scenarios makes it hard for any single combination of factors to occur frequently enough to meet a high support threshold. There can be association rule bias because of overfitting just like the association rules mining on “Moderate” stolen value. The use of parameters like support, confidence, and lift could lead to overfitting of the data. Overfitting occurs when rules capture too specific patterns that may not generalize well. Furthermore, setting thresholds for support and confidence may lead to overlooking important relationships or generating too many weak rules. For example, the super low support threshold for mining the rules for “Moderate” stolen values could yield numerous rare but interesting patterns, but these might not be significant enough, and even difficult to make reasonable interpretations.
Cargo Theft Definition - https://www.fbi.gov/investigate/transnational-organized-crime/cargo-theft↩︎
Classification of Age Groups Based on Facial Features - https://www.researchgate.net/publication/228404297_Classification_of_Age_Groups_Based_on_Facial_Features↩︎
Crime Analysis Based on Association Rules Using Apriori Algorithm - https://www.researchgate.net/publication/321338934_Crime_Analysis_Based_on_Association_Rules_Using_Apriori_Algorithm↩︎
APPLICATION FOR ANALYSIS AND PREDICTION OF CRIME DATA USING DATA MINING - http://www.iraj.in/journal/journal_file/journal_pdf/3-253-14650168849-12.pdf↩︎
Association Rules Mining in Crime Data Analysis - https://ieeexplore.ieee.org/document/10712467↩︎
Prediction of Criminal Suspects Based on Association Rules and Tag Clustering - https://www.scirp.org/journal/paperinformation?paperid=91425↩︎